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Abstract We present a practical implementation of an optimal first-order method, 
due to Nesterov, for large-scale total variation regularization in tomographic recon- 
struction, image deblurring, etc. The algorithm applies to /i-strongly convex objective 
functions with L-Lipschitz continuous gradient. In the framework of Nesterov both 
jj. and L are assumed known - an assumption that is seldom satisfied in practice. 
We propose to incorporate mechanisms to estimate locally sufficient jj. and L during 
the iterations. The mechanisms also allow for the application to non-strongly convex 
functions. We discuss the iteration complexity of several first-order methods, inclu- 
ding the proposed algorithm, and we use a 3D tomography problem to compare the 
performance of these methods. The results show that for ill-conditioned problems 
solved to high accuracy, the proposed method significantly outperforms state-of-the- 
art first-order methods, as also suggested by theoretical results. 

Keywords Optimal first-order optimization methods ■ strong convexity • total 
variation regularization ■ tomography 
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1 Introduction 

Large-scale discretizations of inverse problems GUI arise in a variety of applications 
such as medical imaging, non-destructive testing, and geoscience. Due to the inher- 
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ent instability of these problems, it is necessary to apply regularization in order to 
compute meaningful reconstructions, and this work focuses on the use of total varia- 
tion which is a powerful technique when the sought solution is required to have sharp 
edges (see, e.g., II 1211331 for applications in image reconstruction). 

Many total variation algorithms have already been developed, such as time march- 
ing 11331 . fixed-point iteration [37 1, and various minimization-based methods such as 
sub-gradient methods Al TOl . second-order cone programming (SOCP) [ 18 1, duality- 
based methods [ 8 1 1 22 1 , and graph-cut methods ll9l [T6l . 

The difficulty of a problem depends on the linear operator to be inverted. Most 
methods are dedicated to denoising, where the operator is simply the identity, or 
possibly deblurring of a simple blur, where the operator is invertible and represented 
by a fast transform. For general linear operators with no exploitable structure, such 
as in tomographic reconstruction, the selection of algorithms is limited. Furthermore, 
the systems that arise in real-world tomography applications, especially in 3D, are 
so large that memory-requirements preclude the use of second-order methods with 
quadratic convergence. 

Recently, Nesterov's optimal first-order method |27 , 28 1 has been adapted to, and 
analyzed for, a number of imaging problems [ 14- , 38 1 . In [38 1 it is shown that Nes- 
terov's method outperforms standard first-order methods by an order of magnitude, 
but this analysis does not cover tomography problems. A drawback of Nesterov's al- 
gorithm (see, e.g., ifTol ) is the explicit need for the strong convexity parameter and 
the Lipschitz constant of the objective function, both of which are not available in 
practice. 

This paper describes a practical implementation of Nesterov's algorithm, aug- 
mented with efficient heuristic methods to estimate the unknown Lipschitz constant 
and strong convexity parameter. The Lipschitz constant is handled using backtrack- 
ing, similar to the technique used in JU. To estimate the unknown strong convexity 
parameter - which is more difficult - we propose a heuristic based on adjusting an 
estimate of the strong convexity parameter using a local strong convexity inequality. 
Furthermore, we equip the heuristic with a restart procedure to ensure convergence 
in case of an inadequate estimate. 

We call the algorithm UPN (Unknown Parameter Nesterov) and compare it with 
two versions of the well-known gradient projection algorithm; GP: a simple version 
using a backtracking line search for the stepsize and GPBB: a more advanced version 
using Barzilai-Borwein acceleration [2| and with the backtracking procedure from 
[19] • We also compare with a variant of the proposed algorithm, UPNo, where the 
strong convexity information is not enforced. This variant is similar to the FISTA 
algorithm J4) . We have implemented the four algorithms in C with a MEX interface 
to MATLAB, and the software is available from www2 . imm . dtu . dk/~pch/TVReg/ 

Our numerical tests demonstrate that the proposed method UPN is significantly 
faster than GP, and as fast as GPBB for moderately ill-conditioned problems, and 
significantly faster for ill-conditioned problems. Compared to UPNo, UPN is con- 
sistently faster, when solving to high accuracy. 

We start with introductions to the discrete total variation problem, to smooth and 
strongly convex functions, and to some basic first-order methods in Sections [2] [3] and 
|4] respectively. Section [5] introduces important inequalities while the new algorithm 
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is described in Section|6] The 3D tomography test problem is introduced in Section[7] 
Finally, in Section[8]we report our numerical tests and comparisons. 

Throughout the paper we use the following notation. The smallest singular value 
of a matrix A is denoted a m i n (A). The smallest and largest eigenvalues of a symmetric 
semi-definite matrix M are denoted by A m ; n (M) and Amax(Af). For an optimization 
problem, / is the objective function, x* denotes a minimizer, f* = f(x*) is the opti- 
mum objective, and x is called an e-suboptimal solution if f(x) — f* < e. 



2 The Discrete Total Variation Reconstruction Problem 

The Total Variation (TV) of a real function X(t) with t £ Q. C K p is defined as 

T(X) = [ \\VX{t)\\ 2 dt. (2.1) 
Jn 

Note that the Euclidean norm is not squared, which means that T(X) is non-differen- 
tiable. In order to handle this we consider a smoothed version of the TV functional. 
Two common choices are to replace the Euclidean norm of the vector z by either 
( I k 1 1 2 + P 2 ) 1 12 or the Huber function 

I tM\1 else - 

In this work we use the latter, which can be considered a prox-function smoothing 
1 28 1 of the TV functional [5 1; thus, the approximated TV functional is given by 

TAX) = ( <P T (VX)dt. (2.3) 
Ja 

In this work we consider the case t € K 3 . To obtain a discrete version of the TV 
reconstruction problem, we represent X(t) by an = m xnxl array X, and we let 
x = vec(X). Each element or voxel of the array X, with index j, has an associated 
matrix (a discrete differential operator) Dj 6 M? xN such that the vector Djx £ K 3 
is the forward difference approximation to the gradient at xj. By stacking all Dj we 
obtain the matrix D of dimensions 3A^ x A^: 



(D 



D = 



(2.4) 



We use periodic boundary conditions in D, which ensures that only a constant x has 
a TV of 0. Other choices of boundary conditions could easily be implemented. 

When the discrete approximation to the gradient is used and the integration in 
\2.~i) is replaced by summations, the discrete and smoothed TV function is given by 

N 

7V(jc) = £ 4>r(D jX ). (2.5) 
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The gradient VT T (x) £ Mr of this function is given by 

N 

V7V(jc) = £ DjD ; x/max{T, \\Djx\\ 2 }. (2.6) 
7=1 

We assume that the sought reconstruction has voxel values in the range [0, 1], 
so we wish to solve a bound-constrained problem, i.e., having the feasible region 
Q = {x e K. N I < xj < 1}. Given a linear system Ax « b where A e R MxN and 
A' = mnl, we define the associated discrete TV regularization problem as 

x* = argmin (j>(x), <j)(x) = \ \\Ax - b\\\ + a T T (x), (2.7) 

xeQ 

where a > is the TV regularization parameter. This is the problem we want to 
solve, for the case where the linear system of equations arises from discretization of 
an inverse problem. 

3 Smooth and Strongly Convex Functions 

To set the stage for the algorithm development in this paper, we consider the convex 
optimization problem min^ e g/(x) where / is a convex function and Q is a convex 
set. We recall that a continuously differentiable function / is convex if 

/W>/(y)+V/(y) r (x-y), Vx,yeR N . (3.1) 

Definition 3.1 A continuously differentiable convex function / is said to be strongly 
convex with strong convexity parameter /I if there exists a jX > such that 

f(x)>f(y) + yf(y) T (x-y) + ^\\x~y\\l Vjt,yeR*. (3.2) 

Definition 3.2 A continuously differentiable convex function / has Lipschitz contin- 
uous gradient with Lipschitz constant L, if 

f{x)<f{y)+Vf(y) T {x-y) + \L\\x~y\\l, Vijet". (3.3) 

Remark 3.1 The condition ( |3.3| > is equivalent ll27l Theorem 2.1.5] to the more stan- 
dard way of defining Lipschitz continuity of the gradient, namely, through convexity 
and the condition \\Vf{x) - V/(y)|| 2 < L||*-y|| 2 , V*,y € R N . 

Remark 3.2 Lipschitz continuity of the gradient is a smoothness requirement on /. 
A function / that satisfies ( |3.3| ) is said to be smooth, and L is also known as the 
smoothness constant. 

The set of functions that satisfy ( |3.2| > and ( |3.3| is denoted J^x- It is clear that 
H < L and also that if > an d < then / <= J-fi u L { =^ / € J-^.Lq- Given 
fixed choices of jj. and L, we introduce the ratio Q = L/fi (sometimes referred to as 
the "modulus of strong convexity" [25 1 or the "condition number for f" ll27ll ) which 
is an upper bound for the condition number of the Hessian matrix. The number Q 
plays a major role for the convergence rate of optimization methods we will consider. 
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Lemma 3.1 For the quadratic function f(x) — \ \\Ax — b\fa with A G R MxN we have 
- _ i (*T*\ _ / ffmin(A) 2 if rank(A)=Af, 



Z- = M = X^(A'A) = { ™vv — (3 .4) 

ana? ;/rank(A) = N fnen 2 — K"(A) 2 , the square of the condition number of A. 

Proof Follows from f(x) = f(y) + (Ay - b) T A(x - y) + \ (x - y) T A T A(x - y), the 
second order Taylor expansion of / about y, where equality holds for quadratic /. □ 



Lemma 3.2 For the smoothed TV function \2.5) we have 

L=\\D\$/t, ji = 0, (3.5) 
where \\D\\\ < 12 in the 3D case. 

Proof The result for L follows from ll28l Thm. 1] since the smoothed TV functional 
can be written as If5lfl4ll 

T T (x) =max\ u T Dx- ^ \\u\\l : |k|| 2 < 1, Vi= l,...,Af) 

with m = (mj', . . . 7 uli) T stacked according to D. The inequality ||D||2 < 12 follows 
from a straightforward extension of the proof in the Appendix of fl4l . For fi pick 
y = ae G R N and x = fie G R N , where e = (!,... ,l) T , and a ^ J3 G E. Then we get 
T T (x) = r T (y) = 0, Vr T (y) = and obtain 

\pL\\x-y\\ 2 2 < T r (x)-T z (y)-VT T (y) T (x-y)=0, 

and hence jj, =0. □ 



Theorem 3.1 For the function <j)(x) defined in ( |2.7| > we nave a strong convexity pa- 
rameter jx = X m i n (A T A) and Lipschitz constant L= ||A||2 + 0f ||D|||/t. //rank(A) <N 
then jj, = 0, otherwise jl — (7 m i n (A) 2 > ant/ 

where k(A) = ||A||2/<7 m m(A) is the condition number of A. 

Proof Assume rank(A) = N and consider f(x) — g(x) + h(x) with g G Fii g ,L g and 
h G J-^,,L h - Then / G T il; j j p w here \if = \i s -\- and Lf = L g +L/,. From /Z/- and L/ 
and using Lemmas 3.1 and 3.2 withg(jc) = j||Ajc — b\fe andn(jc) = CcT T (x) we obtain 
the condition number tor given in ( |3.6| l. If rank (A) < N then the matrix A T A has at 
least one zero eigenvalue, and thus jl = 0. □ 



Remark 3.3 Due to the inequalities used to derive ( |3.6| l, there is no guarantee that the 
given jl and L are the tightest possible for 0. For rank(A) < N there exist problems 
for which the Hessian matrix is singular and hence jl = 0, but we cannot say if this is 
always the case. 
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4 Some Basic First-Order Methods 

A basic first-order method is the gradient projection method of the form 

x(* +1 )=P Q (/*)-0 fc V/(*«)), jfc = 0,l,2,.... (4.1) 

The following theorem summarizes the convergence properties. 

Theorem 4.1 Let f G F^x> ®k — 1/^ andx* G Q be the constrained minimizer of f, 
then for the gradient method ( |4.1| > we have 

f(xW)-f<~\\xW-x*f 2 . (4.2) 

Moreover, if pi =^ f/ien 

/(^)-r<(i-f)"(/(x (0) )-r). (4.3) 

Proof The two bounds follow from 13611 and |[25l §7.1.4], respectively. □ 

To improve the convergence of the gradient (projection) method, Barzilai and 
Borwein [2] suggested a scheme in which the step 0jtV/(jc®) provides a simple and 
computationally cheap approximation to the Newton step (V 2 /(x^')) _1 V/(x^). For 
general unconstrained problems with / G F^x> possibly with jj. = 0, non-monotone 
line search combined with the Barzilai-Borwein (BB) strategy produces algorithms 
that converge l32l ; but it is difficult to give a precise iteration complexity for such 
algorithms. For strictly quadratic unconstrained problems the BB strategy requires 
O (Qloge -1 ) iterations to obtain an e-suboptimal solution ifTSIl . In ifTTl it was ar- 
gued that, in practice, O (Qloge -1 ) iterations "is the best that could be expected". 
This comment is also supported by the statement in l27l p. 69] that all "reasonable 
step-size rules" have the same iteration complexity as the standard gradient method. 
Note that the classic gradient method ( |4.1[ ) has 0(L/e) complexity for / £ Fox- To 
summarize, when using the BB strategy we should not expect better complexity than 
0(L/e) for / £ Fox, and O (gloge- 1 ) for / e T^u 

In Algorithm [TJ we give the (conceptual) algorithm GPBB, which implements 
the BB strategy with non-monotone line search [ 6 , 39 1 using the backtracking proce- 
dure from [19 1 (initially combined in (32]). The algorithm needs the real parameter 
(7 G [0, 1] and the nonnegative integer K, the latter specifies the number of iterations 
over which an objective decrease is guaranteed. 

An alternative approach is to consider first-order methods with optimal complex- 
ity. The optimal complexity is defined as the worst-case complexity for a first-order 
method applied to any problem in a certain class [25 ,27 1 (there are also more tech- 
nical aspects involving the problem dimensions and a black-box assumption). In this 
paper we focus on the classes Fq.l and T^^l- 

Recently there has been a great deal of interest in optimal first-order methods 
for convex optimization problems with / e .Fox [3,35]. For this class it is possible 
to reach an e-suboptimal solution within 0(y // L/e) iterations. Nesterov's methods 
can be used as stand-alone optimization algorithm, or in a composite objective setup 
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Algorithm 1: GPBB 



input :x(°\K 



output: 

1 O = l ; 

2 for £ = 0,1,2,. 



do 



// BB strategy 
if k >0 then 

_ 0A ' (jcW-^-'J.V/^Wj-V/^*- 1 ))) ' 

j8 <- 0.95 ; 

x<- P a (xW - p8 k Vf(xW)) ; 
/ <- maxi/C^)),/^^ 1 )), . . .,/(#-*))} ; 
while /(Jc) > /- a V/(xW) r (xW -x) do 
/3^j3 2 ; 

jc<-P c (jcW-/3e t V/(*«)); 



Bl l29l[35l . in which case they are called accelerated methods (because the designer 
violates the black-box assumption). Another option is to apply optimal first-order 
methods to a smooth approximation of a non-smooth function leading to an algorithm 
with O (1/e) complexity [28]; for practical considerations, see I5 HT41 . 

Optimal methods specific for the function class Tn^ with jj, > are also known 
B26II27I : see also ||29l for the composite objective version. However, these methods 
have gained little practical consideration; for example in ||29l all the simulations are 
conducted with /I = 0. Optimal methods require O (-^/filoge -1 ) iterations while the 
classic gradient method requires O (Q log e 1 ) iterations 12511271 . For quadratic prob- 
lems, the conjugate gradient method achieves the same iteration complexity as the 
optimal first-order method ll25l . 

In Algorithm [2] we state the basic optimal method Nesterov ll27l with known ji 
and L; it requires an initial 9q > Wjl/L. Note that it uses two sequences of vectors, 
T- k ' and y( k \ The convergence rate is provided by the following theorem. 

Theorem 4.2 Iff e F n , L , 1 > 9 Q > y/Jl/L, and y = then for algorithm 

Nesterov we have 

f<&)-r < ^ (f(x {0) )-f + ^\\x {0) -x*\\l) ■ (4-4) 

Moreover, if \l ^ 

/(*«) ~f < (l- yfl )* (/(- (0) ) -f + f ll- (0) -All) ■ (4.5) 



Proof See l27l (2.2.19), Theorem 2.2.3] and Appendix |A| for an alternative proof. □ 
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Algorithm 2: Nesterov 



input : x(°\ fi, L, O 
output: x( k+ V 

1 y^^x^; 

2 for k = 0,1, 2,... do 

3 x^ +1 ) ^PQ(y {k) ~L- l Vf{y( k ))) ; 

4 e k+i <- positive root of 9 2 = (1 - 0)% 2 + £0 

6 y^ 1 ) <-Jc( fe+1 )+j3 ft (^ +1 ) -iW) ; 



Except for different constants Theorem 4.2 mimics the result in Theorem |4.1| 
with the crucial differences that the denominator in ( |4.4| i is squared and fi/L in ( |4.5[ ) 
has a square root. Comparing the convergence rates in Theorems |4.1|an d |4.2| we see 
that the rates are linear but differ in the linear rate, Q~ l and \J Q~ l , respectively. 
For ill-conditioned problems, it is important whether the complexity is a function of 
Q or y/Q, see, e.g., Il25l §7.2.8]. This motivates the interest in specialized optimal 
first-order methods for solving ill-conditioned problems. 



5 First-Order Inequalities for the Gradient Map 

For unconstrained convex problems the (norm of) the gradient is a measure of how 
close we are to the minimum, through the first-order optimality condition, cf. Q. 
For constrained convex problems min xG g/(x) there is a similar quantity, namely, the 
gradient map defined by 

G v (x) = v(x-P Q (x-v- 1 Vf(x))). (5.1) 

Here V > is a parameter and V can be interpreted as the step size of a gradient 
step. The function Pq is the Euclidean projection onto the convex set Q [27 1. The 
gradient map is a generalization of the gradient to constrained problems in the sense 
that if Q = R N then G v (x) = Vf(x), and the equality G v (x*) = is a necessary and 
sufficient optimality condition |36|. In what follows we review and derive some im- 
portant first-order inequalities which will be used to analyze the proposed algorithm. 
We start with a rather technical result. 

Lemma 5.1 Let f G Fn^fixxG Q,y<E R N , and setx + — Pq{j — L _1 V 'fiy)), where 
p. and L are related to x,y and x + by the inequalities 

f(x)>f(y)+yf(yf(x-y) + lfi\\x-y\\i (5.2) 

f(x + ) < fiy) + V/(y) r (x+ -y) + ±L||x+ -y\\l (5.3) 

Then 

f(x + )<f(x)+G L (y) T (y~x)- ^llG^ll- \fl\\y-x\\i (5.4) 
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Proof Follows directly from 11271 Theorem 2.2.7]. □ 



Note that if / € Tu.u then in Lemma 5.1 we can always select jl = and L = L 



to ensure that the inequalities \5.2) and \53) are satisfied. However, for specific x, y 
and x + , there can exist ft > jl and L < L such that ( |5.2| > and ( |5.3| ) hold. We will use 
these results to design an algorithm for unknown parameters fi and L. 

The lemma can be used to obtain the following lemma. The derivation of the 
bounds is inspired by similar results for composite objective functions in \29\, and 
the second result is similar to [27, Corollary 2.2.1]. 

Lemma 5.2 Let f e fix ye R N , and setx + = P Q (y — L~ l Vf(y)). Let p, andt 
be selected in accordance with \5.2\ and \53\ respectively. Then 



Ify € Q then 

2 



kfi\\y-x*\\2<\\G- L (y)\\2. (5.5) 



ll-'WGiiy)^ < f{y)-f{x + ) < m-f. (5.6) 



Proof From Lemma 5.1 with i=i*we use f{x + ) > f* and obtain 

kfi\\y-x*\\ 2 2 < G L {yf{y-x*) \L-'\\G L {y)\\l < ||Gi(y)||2|b-x1| 2) 
and (|5.5[) follows; Eq. (|5.6[) follows from Lemma 5.1 using y — x and /* < f(x + ). □ 



As mentioned in the beginning of the section, the results of the corollary say that 
we can relate the norm of the gradient map at y to the error | |y — jc* 1 1 2 as well as to 
f(y) — /*. This motivates the use of the gradient map in a stopping criterion: 

HGzMll2<£, (5.7) 

where y is the current iterate, and L is linked to this iterate using ([53). The parameter 
e is a user-specified tolerance based on the requested accuracy. Lemma 5.2 is also 
used in the following section to develop a restart criterion to ensure convergence. 



6 Nesterov's Method With Parameter Estimation 

The parameters fi and L are explicitly needed in Nesterov, but it is much too expen- 
sive to compute them explicitly; hence we need a scheme to estimate them during the 
iterations. To this end, we introduce the estimates ji^ and L^ of jl and L in each iter- 
ation k. We discuss first how to choose L^, then jj.^, and finally we state the complete 
algorithm UPN and its convergence properties. 



To ensure convergence, the main inequalities (A. 61 and (A.7i must be satisfied 



Hence, according to Lemma [5T| we need to choose Lf, such that 

/(*(*+!)) < + V/(y«) -y [k} ) + \L k \\x^ -y^\\l (6.1) 



This is easily accomplished using backtracking on L^ 2] . The scheme, BT, takes the 
form given in Algorithm [3] where pi > 1 is an adjustment parameter. If the loop 
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Algorithm 3: BT 



input :y,L 
output: x,L 

1 L^L; 

2 x ^P Q (y-L- 1 Vf(y)) ■ 

3 while f(x)>f(y) + Vf(y) T {x- 



-y). 



m\x- 



■y|||do 



L <— PlL ; 
x^P Q (y- 



L- l Vf(y)) ■ 



is executed «bt times, the dominant computational cost of BT is «bt + 2 function 
evaluations and 1 gradient evaluation. 



According to (A. 7 1 with Lemma 5.1 and (A. 8 i, we need to select the estimate jXk 



such that < where satisfies 

/(**) >f(y {k) ) + Vf(y {k) )\x*-y^) + \riV~y {k) \\l (6-2) 

However, this is not possible because x* is, of course, unknown. To handle this prob- 
lem, we propose a heuristic where we select pik such that 

/(*») > fiy^) + Vf(yW) T (xW -,(*)) + y k \\ x W -y®\\l (6.3) 

This is indeed possible since and yW are known iterates. Furthermore, we want 
the estimate to be decreasing in order to approach a better estimate of /i. This can 
be achieved by the choice 

Al, = min{^-i,M(x« (6.4) 

where we have defined the function 

f f{x)-f(y)-Vf(y) T (x-y) 



M(x,y) = { i\\x-y\\2 " (6.5) 



if x^y, 
else. 



In words, the heuristic chooses the largest fi^ that satisfies ( |3.2[ > for x^ and yW, as 
long as fib is not larger than /4_ i . The heuristic is simple and computationally in- 
expensive and we have found that it is effective for determining a useful estimate. 
Unfortunately, convergence of Nesterov equipped with this heuristic is not guaran- 
teed, since the estimate can be too large. To ensure convergence we include a restart 
procedure RUPN that detects if jU* is too large, inspired by the approach in l29l §5.3] 
for composite objectives. RUPN is given in Algorithm[4] 

To analyze the restart strategy, assume that jj,, for all z — 1 , . . . , k are small enough, 
i.e., they satisfy jj,, < jj,* for i = 1 , . . . , k, and ji^ satisfies 

/(**) >f(x {Q) )+Vf(x^) T (x*-x^) + ^k\\x*-x m \\l (6.6) 
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Algorithm 4: RUPN 



1 n =0i(0iLi-mi)/(1-0i); 

2 if jU^ 7^ and inequality ( |6.9| l not satisfied then 
3 

p M /i*, e); 



abort execution of UPN; 
restart UPN with input (x^ +1 ) 



When this holds we have the convergence result (using (|A.9[)) 



/(*' 



(*+i)\ 



/*<j7(i_vS7L;)(/(^)-r- 



|nll* (1) 



-'111 



(6.7) 



We start from iteration k = 1 for reasons which will presented shortly (see Ap- 
pendix [A] for details and definitions). If the algorithm us es a p rojected gradient step 
from the initial x^ to obtain the rightmost factor of (6.7 1 can be bounded as 



f(x m )-f+ln\\x m -x*t2 

2 1 ,2k 



inlk (0) - 



.*l|2 



< 



|Gl (x^)II 



(6.8) 



Here we used Lemma |5.1| and the fact that a projected gradient step reduces the 



Euclidean distance to the solution [27, Theorem 2.2.8]. Using Lemma 5.2 
at the bound 



we arrive 



i£ _1 I 

2 L i+l I 



(x {k+l) )\\l 




(Oh 



(6.9) 



If the algorithm detects that ( |6.9[ ) is not satisfied, it can only be because there was 
at least one /!,- for i — l,...,k which was not small enough. If this is the case, we 
restart the algorithm with a new p, p^jJ-k, where < p^ < 1 is a parameter, using 
the current iterate ;t;^' +1 ) as initial vector. 

The complete algorithm UPN (Unknown-Parameter Nesterov) is given in Algo- 
rithm [5] UPN is based on Nestero y's o ptimal method where we have included back- 
tracking on Lk and the heuristic ( 6.4 1. An initial vector x^ and initial parameters 
p, > jJ, and L<L must be specified along with the requested accuracy e. The changes 
from Nesterov to UPN are at the following lines: 



1: Initial projected gradient step to obtain the bound (6.8 i and thereby the bound 

( |6.9| l used for the restart criterion. 
5: Extra projected gradient step explicitly applied to obtain the stopping criterion 

\\G Lk Jx^)\\ 2 <e. 



6,7: Used to relate the stopping criterion in terms of £ to e, see Appendix B.3 



8: The heuristic choice of jXk in ( 6.4 1 
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Algorithm 5: UPN 

input : ,p.,L,e 
output: orx( k+1 "> 

1 [x( l \Lo}<~BT(x(°\L) ■ 

2 Hq = p., y (1) f-i (1) , 0i <- vW^o ; 

3 for k= 1,2,... do 

4 [Z+'l.L^BTfy^Ln); 

5 [^+ 1 ),L, +1 ]^BT(x^+ 1 ),L,); 

6 if 1 1 G- Lk+ 1 (x< A: + 1 ))|| 2 <e then abort, return #+ 1 ) ; 

7 if ||G Llfe Cy (fc) )||2 < e then abort, return ; 

8 At*^min{^-i,M(xW,y«)}; 

9 RUPN; 

10 <- positive root of 2 = (1 - 0)0% + (n k /L k ) ; 

11 p k ^Q k {l-0 k )/(Ql + Q k+1 ); 

12 y^+V + ; 



10: The restart procedure for inadequate estimates of /x. 

We note that in a practical implementation, the computational work involved in 
one iteration step of UPN may - in the worst case situation - be twice that of one 
iteration of GPBB, due to the two calls to BT. However, it may be possible to imple- 
ment these two calls more efficiently than naively calling BT twice. We will instead 
focus on the iteration complexity of UPN given in the following theorem. 

Theorem 6.1 Algorithm UPN, applied to f G Jui under conditions p. > fj., L < L, 
£ = yf (ju/2)g, stops using the gradient map magnitude measure and returns an E- 
suboptimal solution with iteration complexity 



0[VQlogQ)+0\^Qloge- L ). (6.10) 
Proof See Appendix [B] □ 



The term O (y/Q log Q) in ( 6. 10 1 follows from application of several inequalities 
invo lving the problem dependent parameters fi and L to obtain the overall bound 
(6.9 1. Algorithm UPN is suboptimal since the optimal complexity is O (y/QlogE~ ) 



but it has the advantage that it can be applied to problems with unknown ji and L. 



7 The 3D Tomography Test Problem 

Tomography problems arise in numerous areas, such as medical imaging, non-destruc- 
tive testing, materials science, and geophysics [21 ,24,30]. These problems amount to 
reconstructing an object from its projections along a number of specified directions, 
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Fig. 7.1 Left: Two orthogonal slices through the 3D Shepp-Logan phantom discretized on a 43 3 grid used 
in our test problems. Middle: Central horizontal slice. Right: Example of solution for a = 1 and T = 10~ 4 . 
A less smooth solution can be obtained using a smaller a. Original voxel/pixel values are 0.0, 0.2, 0.3 and 
1 .0. Color range in display is set to [0.1 ,0.4] for better constrast. 



and these projections are produced by X-rays, seismic waves, or other "rays" pene- 
trating the object in such a way that their intensity is partially absorbed by the object. 
The absorbtion thus gives information about the object. 

The following generic model accounts for several applications of tomography. We 
consider an object in 3D with linear attenuation coefficient X(t), with t £ SI C R 3 . 
The intensity decay £>, of a ray along the line £j through D. is governed by a line 
integral, 

h = log(/ //0 = / X(t)dt = b u (7.1) 

where Iq and are the intensities of the ray before and after passing through the 
object. When a large number of these line integrals are recorded, then we are able to 
reconstruct an approximation of the function X(t). 

We discretize the problem as described in Section[2] such that X is approximated 
by a piecewise constant function in each voxel in the domain Q. = [0,1] x [0,1] x [0,1]. 
Then the line integral along is computed by summing the contributions from all the 
voxels penetrated by i[. If the path length of the ith ray through the jth voxel is 
denoted by a, ; -, then we obtain the linear equations 

N 

ajjXj = bj, i=l,...,M, (7.2) 

7=1 

where M is the number of rays or measurements and N is the number of voxels. This 
is a linear system of equations Ax = b with a sparse coefficient matrix A € M MxA '. 

A widely used test image in medical tomography is the "Shepp-Logan phan- 
tom," which consists of a number superimposed ellipses. In the MATLAB function 
siiepplogan3d |34| this 2D image is generalized to 3D by superimposing ellip- 
soids instead. The voxels are in the range [0, 1], and Fig. |7.1| shows an example with 
43 x 43 x 43 voxels. 

We construct the matrix A for a parallel-beam geometry with orthogonal projec- 
tions of the object along directions well distributed over the unit sphere, in order 
to obtain views of the object that are as independent as possible. The projection di- 
rections are the direction vectors of so-called Lebedev quadrature points on the unit 
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sphere, and the directions are evenly distributed over the sphere; we use the MATLAB 
implementation getLebedevSphere OTTl . For setting up the tomography system ma- 
trix for a parallel beam geometry, we use the Matlab implementation tomobox l23l . 



8 Numerical Experiments 



This section documents numerical experiments with the four methods UPN, UPNo, 
GP and GPBB applied to the TV regularization problem (|2.7|i. We use the two test 



problems listed in Table 8.1 which are representative across a larger class of prob- 
lems (other directions, number of projections, noise levels, etc.) that we have run 
simulations with. The smallest eigenvalue of A T A for Tl is 2.19 • 10~ 5 (as computed 
by Matlab's eigs), confirming that rank(A) = N for Tl. We emphasize that this 
computation is only conducted to support the analysis of the considered problems 
since - as we have argued in the introduction - it carries a considerable computational 
burden to compute. In all simulations we create noisy data from an exact object x eX act 
through the forward mapping b=Ax exact +e, subject to additive Gaussian white noise 
of relative noise level ||e||2/||fc||2 = 0.01. As initial vector for the TV algorithms we 
use the fifth iteration of the iterative conjugate gradient method applied to the least 
squares problem. 

Table 8.1 Specifications of the two test problems; the object domain consists of m xnxl voxels and each 
projection is a p x p image. Any zero rows have been purged from A. 



Problem 


m = n = I 


P 


projections 


dimensions of A 


rank 


Tl 


43 


63 


37 


99361 x 79507 


= 79507 


T2 


43 


63 


13 


33937 x 79507 


< 79507 



To investigate the convergence of the methods, we need the true minimizer x* 
with <j)(x*) = <p*, which is unknown for the test problem. However, for comparison 
it is enough to use a reference solution much closer to the true minimizer than the 
iterates. Thus, to compare the accuracy of the solutions obtained with the accuracy 
parameter e, we use a reference solution computed with accuracy (e • 10~ 4 ), and with 
abuse of notation we use x* to denote this reference solution. 

We compare the algorithm UPN with GP (the gradient projection method ( |4.1[ ) 
with backtracking line search on the step size), GPBB and UPNo- The latter is UPN 
with = for all i = 0, • • • , k and 9\ = 1 . The algorithm UPNo is optimal for the class 
J~o,l and can be seen as an instance of the more general accelerated/fast proximal 
gradient algorithm (FISTA) with backtracking and the non-smooth term being the 
indicator function for the set Q, see [4, 3 29] and the overview in [35]. 



8.1 Influence of a and x on the convergence 

For a given A the theoretical modulus of strong convexity given in ( |3.6| ) varies only 
with a and x. We therefore expect better convergence rates (|4.3[) and (|4.5|l for smaller 



a and larger x. In Fig. 8.1 we show the convergence histories for Tl with all combi- 
nations of a = 0.01, 0.1, 1 and x = 10~ 2 , 10~ 4 , 10~ 6 . 
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a = 0.01, x = 0.01, ah =1e+00 a = 0.1 , x = 0.01, alx =1 e+01 a= 1, x = 0.01, alx =1e+02 




a = 0.01, t = 0.0001, ah =1e+02 a = 0.1, x = 0.0001 , ah =1e+03 a = 1, x = 0.0001, ah =1e+04 




a = 0.01, x = 1e-06, alx =1e+04 a = 0.1, x= 1e-06, ah =1e+05 a = 1, x = 1e-06, alx =1e+06 




Iteration k Iteration k Iteration k 



Fig. 8.1 Convergence histories (0(jc w ) - 0*)/0* vs - k for T1 with « = 0-01, 0.1 and 1 and X — 10~ 2 , 
1(T 4 and 1(T 6 . 

For low ajx ratios, i.e., small condition number of the Hessian, GPBB and GP 
requires a comparable or smaller number of iterations than UPN and UPNo- As a/x 
increases, both GPBB and GP exhibit slower convergence, while UPN is less af- 
fected. In all cases UPN shows linear convergence, at least in the final stage, while 
UPNo shows sublinear convergence. Due to these observations, we consistently ob- 
serve that for sufficiently high accuracy, UPN requires the lowest number of itera- 
tions. This also follows from the theory since UPN scales as C(loge _1 ), whereas 
UPNo scales at a higher complexity of 0(V £ _1 ). 

We conclude that for small condition numbers there is no gain in using UPN com- 
pared to GPBB. For larger condition numbers, and in particular if a high-accuracy 
solution is required, UPN converges significantly faster. Assume that we were to 
choose only one of the four algorithms to use for reconstruction across the condition 
number range. When UPN requires the lowest number of iterations, it requires signi- 
ficantly fewer, and when not, UPN only requires slightly more iterations than the best 
of the other algorithms. Therefore, UPN appears to be the best choice. Obviously, the 
choice of algorithm also depends on the demanded accuracy of the solution. If only a 
low accuracy, say (<j) W — = 10~ 2 is sufficient, all four methods perform more 

or less equally well. 



8.2 Restarts and jU* and histories 

To ensure convergence of UPN we introduced the restart functionality RUPN. In 
practice, we almost never observe a restart, e.g., in none of the experiments reported 
so far a restart occurred. An example where restarts do occur is obtained if we in- 
crease a to 100 for Tl (still T = 10~ 4 ). Restarts occur in the first 8 iterations, and 
each time pit is reduced by a constant factor of = 0.7. In Fig. 8.2 left, the /x^ 
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oc = 1 00, x = 0.0001 , oc/x =1 e+06 a = 1 , x = 0.0001 , a/x =1 e+04 

10 s , . . . , 10° 

10* 
10 
10 : 

1 °0 500 1 000 1500 2000 '" 500 1000 1500 2000 

Iteration k Iteration k 

Fig. 8.2 The fl k , L k histories for Tl. Left: (X — 100 and X — 10~ 4 . Right: a = 1 and T = 10~ 4 . 




and histories are plotted vs. k and the restarts are seen in the zoomed inset as the 
rapid, constant decrease in /ij. From the plot we also note that after the decrease in jj.^ 
and an initial increase in L^, both estimates are constant for the remaining iterations, 
indicating that the heuristics determines sufficient values. 

For comparison the and histories for Tl with a — I and x — 10~ 4 are seen 



in Fig. 8.2 right. No restarts occurred here, and jUj decays gradually, except for one 
final jump, while Lf, remains almost constant. 



8.3 A non-strongly convex example 

Test problem T2 corresponds to only 13 projections, which causes A to not have full 
column rank. This leads to X m [ n (A T A) = 0, and hence §{x) is not strongly convex. 
The optimal convergence rate is therefore given by ( |4.4| >; but how does the lack of 
strong convexity affect UPN, which was specifically constructed for strongly convex 
problems? UPN does not recognize that the problem is not strongly convex but simply 
relies on the heuristic \6A\ at the £th iteration. We investigate the conver genc e by 
solving T2 with a = 1 and % = 10~ 4 . Convergence histories are given in Fig. 8.3 left. 
The algorithm UPN still converges li nearl y, although slightly slower than in the Tl 
experiment (a = 1,T = 10~ 4 ) in Fig. 8.1 The algorithms GP and GPBB converge 



much more slowly, while at low accuracies UPNo is comparable to UPN. But the 
linear convergence makes UPN converge faster for high accuracy solutions. 



8.4 Influence of the heuristic 

An obvious question is how the use of the heuristic for estimating ji affects UPN 
compared to Nesterov, where fi (and L) are assumed known. From Theorem |3.1| 
we can compute a strong convexity parameter and a Lipschitz parameter for <j)(x) 
assuming we know the largest and smallest magnitude eigenvalues of A T A. Recall 



that these jj, and L are not necessarily the tightest possible, according to Remark 3.3 
For Tl we have computed Amax(A r A) = 1.52- 10 3 and Kin{A T A) = 2.19 ■ 10~ 5 (by 
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means of eigs in Matlab). Using a = 1, X = 10 4 and ||D||2 < 12 from Lemma 
I3.2l we take 

M = ^min(A r A) =2.19- 1(T 5 , L = A max (A 7 A) + 12- = 1.22- 10 s , 

T 

and solve test problem Tl using UPN with the heuristics switched off in favor of these 
true strong convexity and Lipschitz parameters. Convergence histories are plotted in 
Fig. [83] right. 

The convergence is much slower than using UPN with the heuristics switched 
on. We ascribe this behavior to the very large modulus of strong convexity that arise 
from the true ji and L. It appears that UPN works better than the actual degree of 
strong convexity as measured by fi, by heuristically choosing in each step a ju^ that 
is sufficient locally instead of being restricted to using a globally valid fi. 



9 Conclusion 

We presented an implementation of an optimal first-order optimization algorithm for 
large-scale problems, suited for functions that are smooth and strongly convex. While 
the underlying algorithm by Nesterov depends on knowledge of two parameters that 
characterize the smoothness and strong convexity, we have implemented methods 
that estimate these parameters during the iterations, thus making the algorithm of 
practical use. 

We tested the performance of the algorithm and compared it with two variants of 
the gradient projection algorithm and a variant of the FISTA algorithm. We applied 
the algorithms to total variation-regularized tomographic reconstruction of a generic 
threedimensional test problem. The tests show that, with regards to the number of ite- 
rations, the proposed algorithm is competitive with other first-order algorithms, and 
superior for difficult problems, i.e., ill-conditioned problems solved to high accuracy. 
Simulations also show that even for problems that are not strongly convex, in prac- 
tice we achieve the favorable convergence rate associated with strong complexity. 
The software is available as a C-implementation with an interface to MATLAB from 
www 2 . imm . dtu . dk/~pch/TVReg71 
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A The Optimal Convergence Rate 

Here we provide an analysis of an optimal method for smooth, strongly convex functions without the use 
of estimation functions as in 1271 . This approach is similar to the analysis of optimal methods for smooth 
functions in [35 36 1. The motivation for the following derivations is to introduce the iteration dependent 
L k and fl k estimates of L and [l. This will support the analysis of how L k and n k should be selected. We 
start with the following relations to the "hidden" supporting variables zW and y k 1271 pp. 73-75, 89], 

= M (z «_ v W), (A.1) 
Hk+\ 

jm = (i - e k )% + e k n k = e 2 k L k , y k+l z {k+V) = (1 - e k ) 7kZ {k) + fW*> - e k c Lt (/*>). (a.2) 

In addition we will make use of the relations 

l±i| k (*+l)_ y (*)||2 = 1 ((l-ft^H^-yWlll 

2 2.y k+ i \ 

-29,(1 -e k ) 7k G Ll (y^) T (z^-y^) + ei\\G Lk (y^mi), (A.3) 

* (l-ftftg- £zgli*ft (A .4) 
2 2^ + i 27^1 



which originate from l |A.2) . We will also later need the relation 

= (i- ft)f lk w -j^III - ^ -y w [|2 + (-r* + u ( * +1) + (i - e t )r« w + eW*^ (y (t) -**) 

= ((1- - ^ + ftw) (y«) V + (1 - - ^± (*<*+») V> 

+ %+i(z ( * +1) )V - (i - - ftwt(y ( * ) )V 

= (i - %)f - - ^ - 

+ ^ (lb w - (^)V) + (d-%)f - 9 -f) (y«)V*> 

= (i+e*)f |k«-x»||l-^|| Z «-x»||l+ft^ lb w -^|g. (A.5) 



where we again used (A.2) . We can now start the analysis of the algorithm by considering the inequality 
in Lemma [5~T1 

(1 - 6 k )f(x^) < (1 - e k )f(x (k) ) + (1 - e k )G Lt (y^) T (y^ -jfi)) - (1 - e k )~\\G Lk (yW)\\l (A.6) 

LL k 

where we have omitted the strong convexity part, and the inequality 

lL k - 2 

Adding these bounds and continuing, we obtain 

f(* (k+,) ) < (i-e k )f(xM) + (i-e k )G Lk (yW) T (y (k) -x {k) ) 



+ 6 k f + 9 t G Li (yW) r (y« -**) - eAx*-yWg - ~||G it (y«)||I 

2 2i,j. 



(1 - + - fcJ^ft^W ->•«) 
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7k+i 

+ e k f + e k c Lk (y») V*> - x *)-e k %\s- y « |g--L || Gis (y« ) | 



, (l-9t)0tK^ n (k) tk)\\2 
+ 2^ I|Z > 1,2 

(i - %)/(^) + (i - e k )^.(h t (y»W > -y (k) ) 

Yk+l 

+ e k f + e k c Lt (>■(*» ) T (y« - x*) - e t ^ - y« || 2 ■ 

V 2 2 %+i 



2L, 



\\G Lk (y {k] )\\l 



(i-e*) 2 ^)lk w -^ll 



= (1 - + (1 - 6*)f |k« -y«||I - ^||^+» -y«||| 

+ e k f + e t G it (yW ) T (yW - x*) - ft £ Hjc* - y« || 2 
= (i - + e*/* - ||**-y«||i 



where we have used (ATT}, a trivial inequality, (A.4) , (A3), ( |A.2) , and (A3}. If < then 

_ r + ||# + d -,*n 2 < (i - e k) (/(*«) - r + f ik w 

in which case we can combine the bounds to obtain 



(A.8) 



/(*«)-/*+! Ik w -*l!< Ilia-ft)) (/(* (0) )-/*+f lk (0) -**llij. (A.9) 

where we have also used jc' ' = y' ' and \A. 1\ to obtain jc' ' = z' ' . For completeness, we will show why 
this is an optimal first-order method. Let jl k = fl£ = fl and L k = L. If yo > jl then using (A. 2) we obtain 



Yk+l > M and S k > \fuTL = y/Q~ l . Simultaneously, we also have II*-n (1 - s k) < 7 — — rr (HI 

Lemma 2.2.4], and the bound is then 

/( xW)-r<minf(l-y^)\ 4L ) (/(,(0))-r + ^||,(»)-,*|| 2 ). (A.10) 

This is the optimal convergence rate for the class Tql and Tnr simultaneously 1251271 . 



B Complexity Analysis 

In this Appendix we prove Theorem |6.1| i.e., we derive the complexity for reaching an e-suboptimal 
solution for the algorithm UPN. The total worst-case complexity is given by a) the complexity for the 
worst case number of restarts and b) the worst-case complexity for a successful termination. 

With a slight abuse of notation in this Appendix, \l kr denotes the &th iterate in the rth restart stage, and 
similarly for Lj ,., L k r , x^' T >, etc. The value Jio o is the initial estimate of the strong convexity parameter 
when no restart has occurred. In the worst case, the heuristic choice in \(>A\ never reduces )l k , such that 
we have \l k r = \Xq t , Then a total of R restarts are required, where 

p£/Jo.o = Mo,R<M <=> /?>log(rt), //0/log(l/p M ). 
In the following analysis we shall make use of the relation 

expf-^f— A <(l-5)"<exp(-^ T ), < 5 < 1, n>0. 
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B.l Termination Complexity 

After sufficiently many restarts (at most R), Hq ,- will be sufficient small in which case (6.9) holds and we 
obtain 

,-'"h||2 



u<w^)iii < n (i - vf ) ( 4 -^- ^ n^(^)i 
<- ( 1 -^)^-^ +2 ^) l|G - ( ^ ))l1 ^ 

< exp (—^=) I ^ - ^ + 110^(^)111, 

V vWw7 / V ^ kr ^ Kr J 

where we have used L, - r < Lj+i, and fl, > ix, + i ,. To guarantee ||Gx (jr'* +1 ' r ')||2 < £ we require the 
latter bound to be smaller than e 2 , i.e., 

Solving for k, we obtain 

k = 0{^Q\ogQ)+0{^/Q\oge- 1 ), (B.l) 



where we have used O ( y 'L k , / >n k ,) = O ( L k +\, r I l*k,r) = O (^/Q) . 



B.2 Restart Complexity 



How many iterations are needed before we can detect that a restart is needed? The restart detection rule 
(6.9) gives 



II Gf U' 



,(*+l/)\i|2 



n 



> i 



U,) \ Mfc/ 2L 0.r 
, k 



116^(^)113 



2L Kr + 2L i ^ 1||G ^ (v(0 . r))|| , 



exp 



k \ ( 4L hT 2L Lr 2L, ,.;/, 



y/Li,/Hl,-lJ \ fll, 2Lq, ' \i\ r 
where we have used L, ■ r < L l+ \ r , L l r < and flj ,- > /ij+i, r . Solving for k, we obtain 



110^(^)111, 



L l,r .W.-M^V L l<T t An,U, \ , \\G^ r {x^ r ))\\l 



-1 log, 

m, ) \ \ Mi,-- Lo, r Mr., 



-log 



(B.2) 



Since we do not terminate but restart, we have ||G^ 4+1 (.v'* +l r ')||2 > e. After r restarts, in order to satisfy 
iU. 21 we must have k of the order 



where 



O ( VQ.) 0(l0gQ r ) + 0( ^Qr) 0(log £- 1 ) , 
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The worst-case number of iterations for running R restarts is then given by 



R 

I 

r=0 



f o [sfopCj o(io g ep;)+o (JopC) o(io g e-') 
Vo) \t (\fpi^) [o(ioge P ;)+o(ioge- 



= o 

= o 

= o 
= o 
= o 



i=0 
R 



where we have used 



v^J |E° [V^J lc( lo ge)+c(ioge- 
^{©(^[©(togflj+oftoge- 1 )]} 

v/g) O (log Q) + O ( Vq) O (log e" 1 ) 
V^logel+Ofv^loge- 1 )' 



S°(VS)=s°(v^O=°lT^ri- c1 "'- 



(B.3) 



B.3 Total Complexity 

The total iteration complexity of UPN is given by jE3J plus jBTj: 

o(v / eioge)+o( v / eioge" 1 ). (b.4) 

It is common to write the iteration complexity in terms of reaching an e-suboptimal solution satisfying 
fix) - f* < £ ■ This is different from the stopping criteria 1 1 G- L , r (x {k+ ' - r) ) 1 1 2 < e or 1 1 G Lt ,. (y '* r) ) 1 1 2 < e 
use d in the UPN algorithm. Consequently, we will derive a relation between e and e. Using Lemmas |5.1| 
and " 



5.2 



in case we stop using 1 1 Gu (y ) 1 1 2 < £ we obtain 

/(* ( * +1 ' r) ) -f<(~- oh) \\GL tr (y^)\\i < l\\G Ltr {y {k - r) )\\l < 

and in case we stop using ||Gx (^ fc+1 ' r ^)||2 < £ 5 we obtain 

f¥»»)-r < (I - 3^) \\o Lk+h M^)\\l < j^c^Mi < 

To return with either /(jc'* +1,r )) — /* < e or /(^'* + ' :r ') — /* < e we require the latter bounds to hold and 
thus select (2/(1 ) e 2 = e. The iteration complexity of the algorithm in terms of e is then 



O 



= o(y/Q]ogQ) +0 (y^loge- 1 ) , 



where we have used O = O (L//x) = O (Q). 
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